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Cs| . Abstract 
t-H ■ 

■ We describe a data structure that supports access, rank and select operations on a dynamic 

04 . string 5[l,n] over alphabet [l-.c] in worst-case time 0(lgra/lglgra), which is optimal. Symbols 

can be inserted into and deleted from S in 0(lg n/ lglg n) amortized time. Those times are 
better than the best previous dynamic time complexities by a 0(1 + lgc/lglgn) factor. Our 
structure uses nHo(S)+o(n(l+Ho(S))) + 0(a(lg a+lg 1+£ n)) bits, where Hq(S) is the zero-order 
0^ . entropy of S and < e < 1 is any constant. This space redundancy over nHo(S) is also better, 

almost always, than that of the best previous dynamic structures, o(nlgcx) + 0(a(lga + Ign)). 
We can also handle unbounded alphabets in optimal time, which has been an open problem in 
jy^ ■ dynamic sequence representations. 
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1 Introduction 



String representations supporting rank and select queries are fundamental in many data structures, 
including full-text indexes [TTJ Q3J Q2] , permutations [H3E], inverted indexes [3], graphs [T2"] . 
document retrieval indexes [31], labeled trees |151 [5], binary relations [5], and many more. The 
problem is to encode a string S[l, n] over alphabet £ = [l..o~] so as to support the following queries: 

rank a (5', i) = number of occurrences of a £ £ in S[l, i], for 1 < i < ra. 
select a (S', i) = position in S 1 of the i-th occurrence of a S £, for 1 < i < rank Q (5, n). 

access^, i) = S[i]. 

There exist various representations of S that support these operations |17 [ 115 1 fTH ], [3] . However, 
these representations are static, that is, S cannot change. In various applications one needs dy- 
namism, that is, to insert and delete symbols in S. There are not many dynamic solutions, however. 
All are based on the wavelet tree representation [IT] . The wavelet tree decomposes S hierarchically. 
In a first level, it separates larger from smaller symbols, by marking in a bitmap which symbols of 
S were larger and which were smaller. The two subsequences of S are recursively separated. The 
lg<7 levels of bitmaps describe S, and access, rank and select operations on S are carried out via 
lg<7 rank and select operations on the bitmaps. Insertions and deletions in S can also be carried 
out by inserting and deleting bits from lg a bitmaps (see Section [2] for more details) . 

In the static case, rank and select operations on bitmaps take constant time, and therefore 
access, rank and select on S takes O(lgcr) time [T7] . This can be reduced to 0(1 + Iga/ lg lg n) by 
using multiary wavelet trees [13]. These separate the symbols into p = o(lgn) ranges, and instead of 
a bitmap store a sequence over an alphabet of size p. In the dynamic case, however, the operations 
on those bitmaps or sequences are slowed down. Makinen and Navarro [22J obtained O(lgcrlgn) 
time for all the operations, including updates, by using dynamic bitmaps that handled all the 
operations in time O(lgn). They simultaneously compress the sequence to nH^{S) + o{n\ga) bits. 
Here Hq(S) = X^ae[i cr]( n a/ n ) lg( n / n a) < lgf is the zero-order entropy of S, where n a is the number 
of occurrences of a in S. Gonzalez and Navarro [16] improved the times to 0((1 + lg <J / lg lg n) lg n) 
by extending the results to multiary wavelet trees. In this case, instead of dynamic bitmaps, they 
handled dynamic sequences over a small alphabet (of size p). Finally, He and Munro [IH] and 
Navarro and Sadakane [25] obtained the currently best result, 0((1 + lg a j lg lg n) lg n/ lg lg n) time, 
still within the same space. They did so by improving the times of the dynamic sequences on small 
alphabets to 0(lgn/lglgn), which is optimal even on bitmaps [H]. The S7( (lg nj lg lg n) 2 ) lower 
bound for dynamic range counting [28], and the 0(lgn/lglgn) static upper bound using wavelet 
trees [8j, suggest that no more improvements are possible in this line. 

In this paper we show that this dead-end can be broken by abandoning the implicit assumption 
that, to provide access, rank and select on S, we must provide rank and select on the bitmaps 
(or sequences over [l..p]). We show that all what is needed is to track positions of S downwards 
and upwards along the wavelet tree. It turns out that this tracking can be done in constant time 
per level, breaking the 0(lgn/lglgn) per-level barrier. As a result, we obtain the optimal time 
complexity 0(lgn/lglgn) for the three queries. The update time is also optimal, 0(lg nj lg lg n), 
yet it is amortized. This is G(lg a/ lg lg n) times faster than what was believed to be the "ultimate" 
solution. Moreover, we also improve upon the space by compressing the redundancy of o(n\ga) of 
previous dynamic structures. Our space is nHo(S) + o(n(l + Hq(S))) + 0(a(lga + lg 1+E n)) bits, 
for any constant < e < 1. Finally, we also handle unbounded alphabets (e.g., S = R) in optimal 
time 0(lg<7 + lgn/lglgn), a long-standing problem on dynamic sequences. At the end we also 
describe a number of applications where our result offers new time/space combinations. 
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2 The Wavelet Tree 



Let S be a string over alphabet S = [l..<r]. We associate each a £ S to a leaf t> a of a full balanced 
binary tree T. The essential idea of the wavelet tree structure is the representation of elements 
from a string S by bit sequences stored in the nodes of tree T. We associate a subsequence S(v) of 
S with every node v of T. For the root tv, S(v r ) = S. In general, S(v) consists of all occurrences of 
symbols a G E„ in S, where "E v is the set of symbols assigned to leaf descendants of v. The wavelet 
tree does not store S(v) explicitly, but just a bit vector B(v). We set B(v)[i] = t if the i-th element 
of 5(u) also belongs to S(vt), where vt is the t-th child of v (the left child corresponds to t = and 
the right to i = 1). This data structure (i.e., T and bit vectors B(v)) is called a wavelet tree. 

For any symbol = a and every node v such that a € E v , there is exactly one bit 6„ in -B(u) 
that indicates in which subtree of v the leaf is stored. We will say that such b v encodes S[i] in 
B(v); we will also say that bit b v from B(v) corresponds to a bit 6^ from B(w) if both &„ and b w 
encode the same symbol S[i]. Identifying the positions of bits that encode the same symbol plays 
a crucial role in wavelet trees. Other, more complex, operations rely on our ability to navigate in 
the tree and keep track of bits that encode the same symbol. 

To implement access(»S, i) we traverse a path from the root to the leaf vguy In each visited node 
we read the bit b v that encodes S[i] and proceed to the 6„-th child of v. To answer select a (S, i), we 
identify the index of the bit b Vr in S(v r ) that corresponds to S(v a )[i]. To compute rank Q (S, i), we 
identify the last bit b' that precedes B(v r )[i] and corresponds to some symbol in S(v a ). 

The standard method used in wavelet trees for identifying the corresponding bits is to maintain 
rank/select data structures on bit vectors B(v). Suppose that B(v)\j] = t; we can find the position 
of the corresponding bit in the child of v by answering a query r&nkt(B(v),j). If v is the r-th child 
of a node w, we can find the corresponding bit in w by answering a query select r (-B(iu), j). This 
approach leads to O(lgcr) query times in the static case because rank/select queries on a bit vector 
can be answered in constant time. However, we need Q(lgn/lglgn) time to support rank/select 
and updates on a bit vector j!4j . which multiplies the operation times. A slight improvement can 
be achieved by increasing the fan-out of the wavelet tree to @(lg e n): as before, B(v)[i] = t if the 
z-th element of S(v) also belongs to S(vt) for the t-th. child vt of v. This enables us to reduce the 
height of the wavelet trees and the query time by a G(lglgn) factor. However, it seems that further 
improvements that are based on dynamic rank/select queries in every node are not possible. 

In this paper we use a different approach to identifying the corresponding bits. Instead of 
storing a bitmap B(v) in an array, we use a compact list structure L(v) to store B(y). Pointers 
from selected positions in L(v) to the structure L(w) in a parent node w (and vice versa) enable 
us to navigate between nodes of the wavelet tree in constant time. We extend the idea to multiary 
wavelet trees. While similar techniques have been used in some geometric data structures |26} [7], 
applying them on compressed data structures where the bit budget is severely limited is much more 
challenging and requires new ideas. 

3 Basic Structure 

We start by describing the main components of our modified wavelet tree. Then, we show how 
our structure supports access(S', i) and select a (5, i). In the third part of this section we describe 
additional structures that enable us to answer rank a (5, i). Finally, we show how to support updates. 
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Structure. We assume that the wavelet tree T has node degree p = 0(lg e n). We divide sets B(v) 
into blocks and store those blocks in a doubly-linked list L(v). Each block Gi(v), except of the last 
one, contains 0(lg 3 n) consecutive elements from B(v); the last block contains 0(lg 3 n) consecutive 
elements. For each Gi(v) we maintain a data structure Ri(v) that supports rank and select queries 
on elements of Gi(v). Since a block contains a poly-logarithmic number of elements over an alphabet 
of size p, we can answer rank and select queries in 0(1) time using 0{\Gi{v)\/\g l - e n) additional 
bits (see Appendix lAl for details). A pointer to an element B(y)[e] consists of two parts: a unique 
id of the block Gi(v) that contains e and the index of e in Gi(v). We maintain pointers between 
selected corresponding elements in L(v) and its children. If an element £?(i>)[e] = t is stored in a 
block Gi(u) and B(v)[e'] ^ t for any e' in Gi(u) that precedes e, then we store a pointer from e 
to the corresponding element B(ut)[et] in L(ut), where ut is the t-th child of u. If B(v)[e] = t and 
the corresponding et in L(ut) is the first element in its block, then we also store a pointer from e 
to e%. If there is a pointer from e in L(v) to et in L(vt), then we also store a pointer from e% to 
e. All these pointers will be called inter-node pointers. We describe how inter-node pointers are 
implemented later in this section. 

It is easy to see that the number of inter-node pointers from e in L{v) to et in L(vt) is ®(g(v)), 
where g(v) is the number of blocks in L(v). Hence, the total number of pointers that point down 
from a node v is bounded by 0(g(v)p). Since this also equals the number of pointers that point up 
to v, the total number of pointers in the wavelet tree equals 0(^2 veT g(v )p) = 0(n\ga/ lg 3_£ n + 
a\g £ n). The pointers from a block Gi(v) are stored in a data structure Fi(v). Using Fi(v), we can 
find, for any element e in Gt(v) and any 1 < t < p, the last e' preceding e in Gi(v) such that there 
is a pointer from ef to an element e' t in L{yt). We describe in Appendix [21 how Fi(v) implements 
the queries in constant time. 

For the root node v r , we store a partial-sum data structure T'(iv) that contains the number 
of elements in each block of L(v r ). Using V(v r ), we can find the block Gj(v r ) that contains the 
i-th element of S(v r ) = S, as well as the number of elements in all blocks that precede a given 
block Gj(v r ). Both operations can be supported in 0(lgn/lglgn) time [19j[25]. The same data 
structures V(v a ) are also stored in the leaves v a of T. We observe that we do not store a bitmap 
B(v a ) in a leaf node v a . Nevertheless, we divide the (implicit) sequence B(v a ) into blocks and store 
the number of elements in each block in V(y a ); we maintain V(v a ) only if L(v a ) consists of more 
than one block. Moreover we store inter-node pointers from the parent of v a to v a and vice versa. 
Pointers in a leaf node are constructed according to the same rules as in any other node. 

Access and Select Queries. Suppose that the position of an element B(v)[e] = t in L(v) is 
known, and let i v be the index of e in its block Gj(v). Then the position of the corresponding et 
in L(vt) is computed as follows. Using Fj(v), we find the index i' of the largest e' < e in Gj(v) 
such that there is a pointer from e' to e' t in L(vt). Due to our construction, such e' must exist. Let 
i' v and i' t denote the indexes of e' and e' t respectively, and let G(vt) denote the block that contains 
e' t . Let r v = iankt{Gj{v),i v ) and r' v = r&nkt(Gj(v),i' v ). Due to our rules to define pointers, e% also 
belongs to G(yt) and its index is i' t + (r v — r' v ). Thus we can find the position of et in O(l) time if 
the position of -B(t>)[e] = t is known. 

Analogously, assume we know a position B(vt)[et] at Gj(vt) and want to find the corresponding 
position e in its parent node v. Using Fj(vt) we find the last e' t < et in Gj(v t ) that has a pointer 
to its parent, which exists by construction. Let e' t point to e', with index il in a block G(v). Let i' t 
and it the indexes of e' t and et in Gj(vt), respectively. Then, by our construction, e is also in G(v) 
and its index is selectj(G(i> ), rank t (G(u), i') + (it — i' t )). 
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To solve access(S, i), we visit nodes vo = v r ,v\ . . . v p = v a , where Vj is the tj-th child of Vj-i 
and B(vj-i)[ej-i] = tj encodes S[i\. The position of eo is found in 0(lgn/lglgn) time using the 
dynamic partial-sums data structure V(v r ). If the position of &j-\ is known, we can find that of ej 
in 0(1) time, as explained above. When a leaf node v p = v a is reached, we know that S[i] = a. 

To solve select a (S, i), we identify the i-th element in the leaf v a , using structure V(v a ). Let 
e p be its position. Then we traverse the path v p ,v p -i, ... ,vq = v r where Vj-i is the parent of Vj, 
until the root node is reached. In every node Vj, we find ej-i in L(vj—t) that corresponds to e-,- 
as explained above. Finally, we compute the number of elements that precede eo in L{v r ) using 
structure V(v r ). Clearly, access and select require 0((lg<r + lgn)/lglgn) worst-case time. 

Rank Queries. We need some additional data structures for efficient support of rank queries. In 
every node v such that L(v) consists of more than one block, we store a data structure P(v). Using 
P(v) we can find, for any 1 < t < p and for any block Gj(v), the last block Gi{v) that precedes 
Gj(v) and contains an element B(v)[e] = t. P(v) consists of p predecessor data structures Pt(v) 
for 1 < t < p. These Pt(v)s could be implemented as van Emde Boas structures, but this would 
slow down rank to 0(lgcr + lgn/lglgn) time. Instead, we describe in Section H] a way to support 
the predecessor queries in constant time. 

Suppose that e is the j-th element in a block Gi(v). P(v) enables us to find the position of the 
last B(v)[e'] = t that precedes e in B(v). First, we use Ri(v) to compute r = r&nk t (Gi(v), j). If 
r > 0, then e' belongs to the same block as e and its index in the block (Gj(t>)) is selectt(Gj(u), r). 
Otherwise, we use Pt(v) to find the last block Gk{v) that precedes Gi{v) and contains an element 
B(y)[e'\ = t. We then find the last such element in Gkiy) using Rk(y). 

Now we are ready to describe the procedure to answer rank (l (S', i). The symbol a is represented 
as a concatenation of symbols to o t\ o . . . o t p , where each tj is between 1 and t. We traverse the 
path from the root v r = vq to the leaf v a = v p . We find the position eo of the i-th element S in 
v r using the data structure T(v r ). In each node Vj, < j < p, we identify the position of the last 
element B(vj)[e'j] = tj that precedes ej, using P tj (vj). Then we find the element e J+ i in the list 
L(vj+i) that corresponds to e'j. 

When our procedure reaches the leaf node v p , the element e p encodes the last symbol a that 
precedes S[i]. Since we know the position of e p , we know its index k\ in its block Gk(v p ). We can find 
the number ki of symbols in all blocks that precede Gk{v p ) using V{v p ). Clearly, rank a (S', i) = k\ + 
k?,. Because structures P± answer in time 0(1), the overall time for rank is 0((lg<7 + lgn)/lglgn). 

Updates. Now we describe how inter-node pointers are implemented. We say that an element 
of L(u) is pointed if there is a pointer to this element. Unfortunately, we cannot store the position 
of a pointed element in the pointer: when a new element is inserted into a block, indexes of all 
elements that follow it are incremented by 1. Since a block can contain 0(lg 3 n) pointed elements, 
we would have to update up to G(lg 3 n) pointers after each insertion and deletion. 

Therefore we resort to the following two-level scheme. Each pointed element in a block is 
assigned a unique id. When a new element is inserted, we assign it the id max-id + 1, where 
max-id is the maximum id value used so far. We also maintain a data structure Hi(v) for each 
block Gi(v) that enables us to find the position of a pointed element if its id in Gi(v) is known. 
Implementation of Hi(v) is based on standard word RAM techniques and a table that contains ids 
of the pointed elements; details are given in Appendix lAl 

We describe now how to insert a new symbol a into S at position i. Let eo, ■ ■ ■ , e p be the elements 
that will encode a = to o . . . o t p . We can find the position of eo = i in L(v r ) in 0(lgn/lglgn) 
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time using V(v r ), and insert to at that position, B(v r )[eo] = to. Now, suppose that the position of 
B(vj)[ej] = tj in L(vj) is known. Then we can find the last element B(vj)[e'} = tj that precedes ej 
in L(v r ); this can be done in the same way as in the procedure for answering rank queries. Once 
we know the position of e' in L(vj), we can find the position of e" in L(vj+i) that corresponds to 
e' . The element tj + i must be inserted into L(vj+i) immediately after e", at position ej + \ = e" + 1. 

When a new element B(vj)[ej] = t is inserted into L(vj), we update, if necessary, auxiliary data 
structures Hk(vj), F^Vj), Rk(vj), and Pt(vj), where Gk(vj) is the block that contains ej. These 
updates take 0(1) time, see Section H] for structure Pt(vj). Since pointers are bidirectional, changes 
to Fk{vj) trigger changes in the F structures of and Uj+i. Finally, if Vj is the root node or a 
leaf, we also update V(vj). 

If the number of elements in Gk{vj) exceeds 21g 3 n, we split Gk(vj) into two blocks, Gk x {vj) 
and Gk 2 {vj)- The sizes of both blocks differ by at most one element. Then, we rebuild the data 
structures H, F and R for the two new blocks. Note that there are inter- node pointers to Gk(vj) 
that now could become dangling pointers, but all those can be known from F^ivj), since pointers 
are bidirectional, and updated to point to the right places in G^ivj) or Gk 2 (vj). Finally, if vj is 
the root or a leaf, then V{vj) is updated. 

The total cost of splitting a block is dominated by the cost of building the new data structures 
H, F and R. We need 0(lg 3 n) time to rebuild them. It is easy to check that we split a block G^iv) 
at most once for a sequence of 0(lg 3 n) insertions in Gk(v). Hence, the amortized cost incurred by 
splitting a block is 0(1). Therefore the total cost of an insertion in L{v) is 0(1). The insertion of 
a new symbol leads to 0(lgc/lgTgra) insertions into lists L{v). Updates of data structures V(v r ) 
and V{v a ) take 0(lgra/lglgn) time. Hence, the total cost of an insertion is 0((lgcr + lgra)/lglgn). 
Deletions are handled symmetrically. 

Space. We show in Appendix [A] how to manage the data in blocks Gi(v) so that all the ele- 
ments stored in lists L(v) use nig a bits. Since there are 0(nlg<r/lg 3 n + a) blocks overall, all 
the pointers between blocks of the same lists add up to 0(nlgo"/lg 2 n + crlgn) bits. All data 
structures V(v) add up to 0(n/lg 2 n) bits. We showed before that the number of inter-node 
pointers is 0(n lg a/ lg 3_e n + oTg e n), hence all inter-node pointers (i.e., F{ and Hi structures) 
use 0(n lg aj lg 2_£ n + <rlg 1+£ n) bits. Structures Pt(v) (Section [4]) use 0(n lg a/ lg 2 ~ £ n) bits as 
they have t integers per block. Finally in Appendix [A] we show that each structure Ri(v) uses 
0(\Gi(v )|/ lg 1_£ n) extra bits. Hence, all Ri(v) for all blocks and nodes use 0(n lga/ lg 1_e n) bits. 

Finally, note that our structures depend on the value of lgn, so they should be rebuilt when 
[lg n\ changes. Makinen and Navarro [22] describe a way to handle this problem without affecting 
the space nor the time complexities, even in the worst-case scenario. The result is completed in 
the next section, where we describe the changes needed to implement the predecessor structures. 

4 Lazy Deletions and Data Structure P(u) 

The main idea of our solution is based on lazy deletions: we do not maintain exactly S but a 
supersequence S of it. When a symbol a is deleted from S, we retain it in S but take a notice that 
a is deleted. When the number of deleted symbols exceeds a certain threshold, we expunge from 
the data structure all the elements marked as deleted. We define B{u) and the list L(u) for the 
sequence S in the same way as B(u) and L(u) are defined for S. 

Since elements of L(u) are not deleted, we can implement P(u) as an insertion-only data struc- 
ture. For any t, 1 < t < p, we store information about all blocks in a data structure Pt(u). Pt(u) 
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contains one element for each block Gi{u) and is implemented as an incremental split-find data 
structure that supports insertions in O(l) amortized time and queries in O(l) time [21]. Using 
Pt(u), we can find for any Gi(u) the last block preceding Gi(u) and containing an element e = t. 
Hence, we can find for any element e in L(u) the position of the last element B(u)[e'] = t that 
precedes e. This is done in the same way as in Section [3j 

We need some additional data structures to support lazy deletions. A data structure V{u) 
stores the number of undeleted elements in each block of L{u) and supports partial-sum queries. 
We will maintain ~P(u) in the root of the wavelet tree and in all leaf nodes. Moreover, we maintain 
a data structure Di(u) for every block Gi(u), where u is either the root or a leaf node. Di(u) can 
be used to count the number of deleted and undeleted elements before the j'-th element in a block 
G{{u) for any query index j. The implementation of Di(u) will be described in Appendix |Aj We 
can use V(u) and Di{u) to find the position r in L(u) where the j-th undeleted element occurs or 
to count the number of undeleted elements that occur before the position r in L(u). 

We also store a list DEL that contains all deleted symbols that have not yet been expunged 
from the wavelet tree. For any symbol S[i] in list DEL we store a pointer to the element e in L(v r ) 
that encodes S[i]. Pointers in list DEL are implemented in the same way as inter-node pointers. 

Queries. Queries are answered very similarly to Section [3l The main idea is that we can essen- 
tially ignore deleted elements except at the root and at the leaves. 

access(S, i): Exactly as in Section 3, except that eo is the position of the i-th undeleted element in 
L(v r ), found using V{v r ). 

select a (S, i): We find the position of the i-th undeleted element e p in L(v p ), where v p = v a , using 
V(v a ). Then we move up in the tree exactly as in Section O When the root node vq = v r is 
reached, we count the number of undeleted elements that precede eo using V(v r ). 

rank a (S, i): We find the position of the i-th undeleted element eo in L(vq) where vq = v r . Let tj 
be defined as in Section [3l In every node Vj, we find the last element B(vj)[e'j] = tj that 
precedes ej. Note that this element may be a deleted one, but it still drives us to the correct 
position in L(vj + i). We proceed exactly as in Section [3] until we arrive at a leaf v p = v a . At 
this point, we count the number of undeleted elements that precede e p . 

Updates. Insertions are supported in the same way as in Section [3j The only difference is that 
we also update the data structure D^^Vj) when an element ej that encodes the inserted symbol a 
is added to a block Gk{vj). When a symbol S[i] = a is deleted, we append it to the list DEL of 
deleted symbols. Then we visit each block Gk{vj) containing an element ej that encodes S[i] and 
update the data structures Dk(vj). Finally, V(v r ) and V(v a ) are also updated. 

When the number of symbols in the list DEL reaches nj lg 2 n, we perform a cleaning procedure 
and get rid of all the deleted elements. Therefore DEL never requires more than 0(n/lgn) bits. 
For every symbol S[i] stored in DEL, we effectively carry out the deletion procedure of Section [3] 
(recall that the list DEL contains pointers to the positions eo in L(y r ) to delete). Once all symbols 
in DEL are processed, we rebuild from scratch the data structures P{u) for all nodes u. The total 
size of all P(u) structures is 0(n lgcrlg £ nj lg 3 n) elements. Since a data structure for incremental 
split-find can be constructed in linear time, all P(u) can be rebuilt in 0(nlga/ lg 3_e n) time. Hence 
the amortized time to rebuild the P(u)s is 0(lg a/ lg 1_e n), which does not affect the amortized 
time 0((lg<7 + lgn)/lglgn) to carry out the effective deletions. 
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Theorem 1 A dynamic string S[l, n] over alphabet [l..<r] can be stored in a structure using nlga + 
0(n\go~/ lg 1_£ n + a lg 1+e n) bits, for any constant < s < 1, and supporting queries access, rank 
and select in time 0((lger + lgn)/lglgn). Insertions and deletions of symbols are supported in 
0((lg a + lg n)/ lg lg n) amortized time. 

5 Compressed Space and Optimal Time 

We now compress the space of the data structure to zero-order entropy (uHq(S) plus redundancy, as 
defined in the Introduction), while improving the time performance to the optimal 0(lgn/lglgn). 
We first show how a different encoding of the bits within the blocks reduces the nlgcr to uHq(S) 
in the space without affecting the time complexities. Then we use this result in combination with 
alphabet partitioning |3j to obtain the final result, where the redundancy is also compressed. 

5.1 Compressing the Bitmaps 

Raman et al. [29] describe an encoding for a bitmap B[l, n] that obtains uHq(B) + 0(n lg lg nj lg n) 
bits of space. It consists of cutting the bitmap into chunks of length b = lg(n)/2 and encoding each 
chunk i as a pair (cj, Oj): q is the class, which indicates how many Is are there in the chunk, and 
is the offset, which is the index of this particular chunk within its class. The q components add up 
to 0(n lglgn/lgn) bits, whereas the Oj components add up to uHq(B). Navarro and Sadakane [25| 
Sec. 8] describe a technique to maintain a dynamic bitmap in this format. They allow the chunk 
length b to vary, so they encode triples (bi, Ci, Oj) as long as the invariant that any pair hi + bi + \ > b 
is maintained. They show that this retains the space, and that each update affects O(l) chunks. 

We extend this encoding to handle an alphabet [l-.p] [13], so that b = lg (n)/2 symbols, and 
each chunk is encoded as a tuple (bi, c\, . . . , c? , Oj) where c\ counts the occurrences of t in the block. 
The "classes" (b{, c\, . . . , c?) use 0(n lg lgn/ lg 1_e n) bits, and the offsets still add up to uHq(B). 
Blocks are encoded/decoded in 0(1) time, as the class part takes 0(lg e nlglgn) = o(lgn) bits. 

In Appendix [A] we describe how a block is stored as a sequence of miniblocks of 0(lgn) bits, 
whose length may vary within a constant factor. Those miniblocks, while retaining their "logical" 
size, will be physically represented using the new encoding, local to each miniblock. As each 
miniblock will contain a constant number of chunks, it will be processed or updated in constant 
time. The main difference with respect to the plain representation is that an insertion or deletion 
may cause the miniblock to grow or shrink by O(lgra) bits, but we can still handle a constant 
number of miniblock splits or merges in constant time. 

The sum of the local entropies of the chunks, across the whole L(v), adds up to nHo(S v ), and 
these add up to nHo(S) [T7]. The redundancy over the entropy is 0(lg £ nig lgn) bits per miniblock, 
adding up to 0(nlgcrlglgn/lg 1_e n) bits. 

Theorem 2 A dynamic string S[l, n] over alphabet [l..c] can be stored in a structure using uHq(S)+ 
0(n lg a I lg 1_e n + a lg 1+e n) bits, for any constant < e < 1, and supporting queries access, rank 
and select in time 0((lg<7 + lgn)/lglgn). Insertions and deletions of symbols are supported in 
0((lgc + lgn)/lglgn) amortized time. 
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5.2 Alphabet Partitioning 

The redundancy in Theorem [2] is still a (sublinear) function of n lg a. Now we show how to compress 
that redundancy as well, while at the same time reducing the time complexities to optimal. 

We use a technique inspired in an alphabet partitioning idea [3] . To each symbol a we will assign 
a level £ = [~lg(n/n a )~| , so that there are at most lgn levels. Additionally, we assign level \lgn\ + 1 
to the symbols of £ not present in S. For each level £ we will create a sequence S containing the 
subsequence of S formed by the symbols of level £, with their alphabet remapped to [l..cr^], where 
a i is the number of distinct symbols of level £. We will also maintain a sequence of levels S lev , so 
that S^fi] is the level of S[i]. We represent S lev and the S strings using Theorem [2l A few arrays 
handle the mapping between global symbols of £ and local symbols in strings S^: M[l,a] gives the 
level of each symbol, iV[l,<r] gives the position of that symbol inside the local alphabet of its level, 
and local arrays M [l,ag] map local to global symbols. All these are represented as plain arrays. 
Thus a symbol a G £ is represented in string S e , at level £ = M[a], where it is written as symbol 
a' = N[a\. Conversely, a symbol a' in S corresponds to symbol a = M^[a'] G S. 

Barbay et al. [3] show how operations access, rank, and select on S are carried out via a constant 
number of operations in S lev and in some S e . We now extend them to insertions and deletions. To 
insert symbol a at position i in S, we find its level I = M[a] and its translation a' = N[a] inside 
S^. Now we insert £ at position i in S lev , and a' at position r&nkg(S lev , i) in S^. Deletion is similar: 
after mapping, we delete the position S e [ia.nkg(S lev , i)] and then the position S lev [i]. 

If the symbol a we are inserting did not exist in S, it will be assigned the last level £ = [lg n] + 1 
and will not appear in M . In this case we add a at the end of M e , M^[ag + 1] = a, increase ag, set 
N[a] = a i and M[a] = £. Then we proceed with the insertion. Instead, if a deletion removes the 
last occurrence of a, we handle it as a part of the more global update mechanism we explain next. 

We keep track of the current frequency in S of each symbol a G E, n a , and the frequency that a 
had when it was assigned its current level, n' a . We retain the level £ assigned to a symbol a as long as 
n' a /2 < n a < 2n' a . When n a = 2n' a or n a = [n' a /2\ , we move a to a new level £' = |~lg(n/n a )] = £±1, 
as follows. We compute the mapping a' = N[a] of a in S^, change M[a] to £', and compute the new 
mapping a" = + 1 of a in S . Now, for each of the n a occurrences of a' in S e , say S [i] = a' 
(found using i = select a /(S^, 1)), we compute its position j = select i(S lev , i) in S lev , change S lev [j] 
to £', remove symbol S e [i], and insert symbol a" in S at position rank^(5 /e,; , j). We also update 
the mappings: we set M e [a"] = a and N[a] = a", and move the last element of M l to occupy the 
empty slot left by a: M*[a'] = M^[ae] and N[b] = a', where b = M \<J(\. We find all the occurrences 
of a i in S and replace them by a'. Finally, we increase ogi and decrease ag. Of course when 
the target of a is the level £' = \lgn~] +1, we simply delete it instead of moving it to 
Likewise we also remember the total number of symbols n' in the sequence at the time when the 
data structure was created. When n = 2n' or n = n'/2, we rebuild the data structure. 

The number of insertions or deletions that must occur until we change the level of a is n' a = 
0(n a ). Therefore, the process of changing a symbol from one level to another, which costs 0(n a ) 
update operations on S lev , S e , M e , M and N, is amortized over 0(n a ) updates. The same occurs 
with the symbol b mapped to ag in whose occurrences have to be re-encoded as a': Since 
|~lg(n/n^)] = [~lg(n/n4)~|, it follows that rib = Q(n a ). 

Note that we are letting the alphabet of the sequences S e grow and shrink, which our wavelet 
trees do not support. Rather, we create them with the maximum possible alphabet size. Since 
£ = [lg(n/n^)], the smallest frequency n a that can belong to level £ satisfies £ = lg(n/n' a ) and 
n a = n' a /2. The total length of the sequence, n, can also grow by a factor of almost 2; therefore 
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there can be at most n/n a = 2 i+2 distinct symbols at level I. 

Time. The queries on S lev cost time 0(lg nj lg lg n), because its alphabet is of size O(lgn). 
Queries on S e cost time 0((lg2^ + lgn)/lglgn) = 0(lgn/lglgn), since I = O(lgn). The accesses 
to M, N, and M e are constant-time. Therefore, we reach the optimal worst-case time 0(lg nj lg lg n) 
for the three queries. Updates, similarly, cost 0(lgn/ lglgn) amortized time. 

Space. Each symbol a with frequency n a will be stored at a level I = \lg(n/n a )~\ ± 2, which is a 
sequence over an alphabet of size 2^ + °( 1 ). Therefore, calling ni = |S^|, we will spend n a \g(ni/n a ) + 
0(n a lg(2^ +2 )/lg 1 ~ e n) bits for it, according to Theorem [2 This is n a \g(ni/n a ) + o{n a \g{n/n a )) 
bits, which added over the whole S e yields ^2n a \g{n^/n a ) + o(]P n a lg(n/n a )) bits. Now consider 
the occurrences of symbol £ in S lev , which we will also charge to S . These cost n^ lg (n/n^) + 
0(ri£ lglgn/ lg 1_£ n) = n^lg(n/n£) + o{ni). Added to the space spent at itself, and since the 
sum of the n a 's is n^, we obtain J]n a lg(n/n a ) + o(^n a (l + lg(n/n a ))) bits. Now, adding over the 
symbols a of all the levels, we obtain the total space uHq(S) + o(n(l + Hq(S))). Theorem [2] also 
involves a cost of 0(2 £+2 lg 1+£ n) bits per level £, which seem to add up to 0(n lg 1+E n). However, 
this blowup is artificial since it comes from storing at most one almost-empty block at each wavelet 
tree node, and there cannot be more than a nonempty wavelet tree nodes across all the strings S e . 
Thus the extra space is only 0{a\g l+£ n). 

In addition we spend 0(a(lgn + lgo - )) bits for the arrays M, N and M l . Finally, recall that 
we also spend space in storing deleted symbols, but these are at most 0{n/ \g 2 n), and thus they 
cannot increase the entropy by more than 0(n/lgn). We have obtained the final result. 

Theorem 3 A dynamic string S[l, n] over alphabet [l..c] can be stored in a structure using uHq(S)+ 
O (n(l + Hq(S)) / lg 1-6 ra + cr(lg a + lg 1+e n)) bits, for any constant < e < 1, and supporting queries 
access, rank and select in optimal time 0(lgn/ lglgn). Insertions and deletions of symbols are sup- 
ported in 0(lgn/ lglgn) amortized time. 

5.3 Handling General Alphabets 

Our time results do not depend on the alphabet size a, yet our space does, in a way that ensures 
that a gives no problems as long as a = o(n/lg 1+e n) for some constant e > 0. 

Let us now consider the case where the alphabet £ is much larger than the effective alphabet 
of the string, that is, the set of symbols that actually appear in 5 at a given point in time. Let 
us now use a < n to denote the effective alphabet size. Our aim is to maintain the space within 
nH (S) + 0(n(l + H Q (S))/ lg 1 ' 5 n + a\g l+£ n) bits, even when the symbols come from a large 
universe £ = [l..|S|], or even from an unbounded ordered universe such as £ = R or £ = T* (i.e., 
£ are words over strings over another alphabet T). 

Our arrangement into strings S e gives a simple way to handle a sequence over an unbounded 
ordered alphabet. By changing these tables to binary search trees and maintaining the alphabets 
of strings S at size 0(min(cj, 2^)), we retain the space in terms of a, and the times raise to 
0{r\ga + lgn/ lglgn), where r is the time of a single comparison between elements of £ (e.g., 
r = O(l) for £ = R in the comparison model; it is also possible to obtain 0(|u;|) instead of 
0{t lg a) on strings w by using tries). This time is optimal if we can only make binary comparisons 
on £. The space redundancy of our data structure increases only by what is needed to represent 
a elements from £. If, in particular, £ is an integer range [l..|£|], the time can be reduced to 
0(lglg|£| + lgn/ lglgn) and the space increases by 0(crlg|£|) bits, by using y-fast tries [32J. 
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6 Applications 



Our new results impact in a number of applications that build on dynamic sequences. For lack of 
space we describe the most obvious one here and defer the rest to Appendix [Bl 

Dynamic sequence collections. The standard application of dynamic sequences, stressed out 
in several previous papers |111[22"1 [T6 ] I25]. is to maintain a collection C of texts, where one can carry 
out indexed pattern matching, as well as inserting and deleting texts from the collection. With our 
new representation we can improve the time of previous work (yet our update time is amortized). 

Theorem 4 There exists a data structure for handling a collection C of texts over an alphabet [1, a] 
within size nHh(C) + o(nlg a) + 0(alg 1+€ n + a h+1 lgn + mlgn + u;) bits, for any constant e > and 
simultaneously for all h. Here n is the length of the concatenation ofm texts, C = T\0 T2 ■ ■ ■ T m , 
and we assume that a = o(n) is the alphabet size and w = Sl(lg n) is the machine word size under the 
RAM model. The structure supports counting of the occurrences of a pattern P in 0{\P\ lg n/ lg lg n) 
time, and inserting and deleting a text T in 0(lg n+ \T\ lg n/ lg lg n) amortized time. After counting, 
any occurrence can be located in time Oi\gn\g a n). Any substring of length I from any T in the 
collection can be displayed in time 0(lgn(lg a n + £/ Iglgn)). For < h < (alg^n) — 1, for any 
constant < a < 1, the space complexity simplifies to nH^C) + o{n\go~) + 0(m\gn + w) bits. 

The theorem refers to Hh(C), the h-th order empirical entropy of sequence C [23]. This is a 
lower bound to any semistatic statistical compressor that encodes each symbol as a function of the 
h preceding symbols in the sequence, and it holds H^C) < Hh__i(C) < Hq(C) < Iga for any h > 0. 

To obtain Hh from Hq we use Theorem [21 which does not use alphabet partitioning (its times 
become 0(lgn/ lglgn) since we assume a = o(n)). In that version of our structure the sequence 
is encoded using tuples, and the space is the sum of the local entropy of the miniblocks. To offer 
search capabilities, the Burrows- Wheeler Transform (BWT) |10| of C, C bwt , is represented, not C. It 
has been proved [22] that if one splits the sequences stored at the levels of the wavelet tree of C bwt 
into chunks representing G(lgn) bits, then the sum of the zero-order entropies of the chunks add 
up to nHh(C) + 0(a h+1 lg n) bits, and this stays true for the variable-6 chunks we use [25]. Finally, 
the locating and displaying overheads are obtained by marking one element out of lg^ n lg lg n. 

7 Conclusions and Further Challenges 

We have obtained 0(lg n/ lg lg n) time for all the operations that handle a dynamic sequence on 
an arbitrary (known) alphabet [l..er], matching lower bounds that apply to binary alphabets |14j . 
Our structure is faster than previous work [181 ES] by a factor of 8(lgcr/ lglgn). It also compresses 
the redundancy space, using nHo(S) + o(n(l + Ho(S))) + 0(a(lga + lg 1+e n)) bits, instead of 
nHo(S) + o{n lg a) + 0(cr(lg a + lg n)) of previous work. We can also handle unbounded alphabets. 
Our result can be applied to a number of problems; we have described several ones. 

The only remaining advantage of previous work [181 [25] is that their update times are worst- 
case, whereas in our structure they are amortized. Obtaining optimal worst-case time complexity 
for updates is an interesting future challenge. 

Another challenge is to simulate other operations than access, rank and select. Obtaining the 
full functionality of wavelet trees with better time than the current dynamic ones |18[ [25] is unlikely, 
as then we could probably solve range counting on an n x n grid [8] in less than the dynamic lower 
bound, 0((lgn/ lglgn) 2 ) |28j . Yet, there may be some intermediate functionality of interest. 
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A Data Structures for Handling Blocks 



We describe the way the data is stored in blocks Gi (u) , as well as the way the various data structures 
inside blocks operate. All data structures are based on the same idea: We maintain a tree with 
node degree lg 5 n and leaves that contain o(lgn) elements. Since elements within a block can 
be addressed with O(lglgn) bits, each internal node and each leaf fits into one machine word. 
Moreover, we can support searching and basic operations in each node in constant time. 

Data organization. The block data is physically stored as a sequence of miniblocks of 0(lg p n) 
symbols. Thus there are O(lg 2 nlgyo) = 0(lg 2 nlglgn) miniblocks in a block. These miniblocks 
will be the leaves of a r-ary tree T, for r = @(lg 5 n) and some constant < 5 < 1. The height 
of this tree is constant, 0(1/5). Each node of T stores r counters telling the number of symbols 
stored at the leaves that descend from each child. This requires just O(rlglgn) = o(lgn) bits. To 
access any position of Gj(u), we descend in T, using the counters to determine the correct child. 
When we arrive at the leaf, we know the local offset of the desired symbol within the leaf, and can 
access it directly. Since the counters fit in less than a machine word, a small universal table gives 
the correct child in constant time, therefore we have O(l) access to any symbol (actually to any 
G(lg p n) consecutive symbols). 

Upon insertions or deletions, we arrive at the correct leaf, insert or delete the symbol (in 
constant time because the leaf contains 0(lgn) bits overall), and update the counters in the path 
from the root (in constant time as they have o(lgn) bits). The leaves may have lgra to 21gn bits. 
Splits/merges upon overflows/underflows are handled as usual, and can be solved in a constant 
number of operations requiring 0(1) time each (T operates as a B-tree; internal nodes may have r 
to 2r children). 

The space overhead due to the nodes of T is 0(\Gi(u)\ lg s n lglg n/ lg n) bits. We consider now 
the space used by the data itself. 

In order not to waste space, the miniblock leaves are stored using a memory management 
technique by Munro |24j . For our case, it allows us to allocate, free, and access miniblocks of 
length lgn to 21gn in constant time. Its space waste, given that our pointers are of O(lglgn) 
bits, is 0(lglgn/lgn) per allocated miniblock, plus a global redundancy of 0(lg 2 n) bits0 We use 
one structure per block, handling its miniblocks, so the global redundancy adds just 0(n/lgn) 
bits overall. Each structure uses a memory area of fixed-size cells (inside which the variable-length 
miniblocks are stored) that grows or shrinks at the end as miniblocks are created or destroyed. A 
structure giving that functionality is called an extendible array (EA) [30]. We need to handle a 
set of 0(n/lg 3 n) EAs, what is called a collection of extendible arrays. Its functionality includes 
accessing any cell of any EA, letting it grow or shrink by one cell, and create and destroy EAs. The 
following lemma, simplified from the original [30, Lemma 1], and using words of lgn bits, is useful. 

Lemma 1 A collection of a EAs of total size s bits can be represented using s + O (a lg n + sa lg n) 
bits of space, so that the operations of creation of an empty EA and access take constant worst-case 
time, whereas grow/shrink take constant amortized time. An EA of s' bits can be destroyed in time 
0{s'/\gn). 

1 When we store the miniblocks in compressed form, in Section 15.11 their physical size could be as small as 
0(lg E nig lgn). Such small sizes can also be handled, and the overhead with respect to the amount of logical data 
represented is maintained. 
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In our case a = 0(n/ lg 3 n) and s = 0(n\ga), so the extra space used is 0(n/lg 2 n + 
nyJXga / \gn) = o(n lgcr/lgn). 

Structure Ri{u): To support rank and select we enrich T with further information per node. We 
store p counters with the number of occurrences of each symbol in the subtree of each child. The 
node size becomes 0{rp lglgn) = 0(lg e+5 n lglgn) = o(lgn) as long as e + 5 < 1. This dominates 
the total space overhead, which becomes 0(n lg a lglg nj lg 1 ~ e ~ <5 n). By setting 5 and e to slightly 
more than the original e value, this can be rewritten as 0(n lg a/ lg 1_£ n). 

With this information on the nodes we can easily solve rank and select in constant time, by 
descending on T and determining the correct child (and accumulating data on the leftward children) 
in O(l) time using universal tables. Nodes can also be updated in constant time even upon splits 
and merges, since all the counters can be recomputed in 0(1) time. 

Structure Fi(u). This structure stores all the intra- node pointers leaving from block Gi(u), to 
its parent and to any of the p children of the wavelet tree node. 

The structure is a tree Tf very similar in structure to T. The pointers stored are intra-node, 
and thus require 0(lgn) bits. Thus we store a constant number of pointers per leaf. For each 
pointer we store the position in Gi(u) holding the pointer (relative to the starting position of the 
leaf node) and the target position. The internal nodes, of arity r, maintain information on the 
number of positions of Gi(u) covered by each child, and the number of pointers of each kind (1 + p 
counters) stored in the subtree of each child. This requires Oijp lglgn) = o(lgn) bits, as before. 
To find the last position before j holding a pointer of a certain kind (parent or t-th wavelet tree 
child, for any 1 < t < p), we traverse Tf from the root looking for position j. At each node v, it 
might be that the child u where we have to enter holds pointers of that kind, or not. If it does, 
then we first enter into child u. If we return with an answer, we recursively return it. If we return 
with no answer, or there are no pointers of the desired kind below u, we enter into the last sibling 
to the left of u that holds a pointer of the desired kind, and switch to a different mode where we 
simply go down the tree looking for the rightmost child with a pointer of the desired kind. It is 
not hard to see that this procedure visits 0(1/ 5) nodes, and thus it is constant-time because all 
the computations inside nodes can be done in O(l) time with universal tables. When we arrive at 
the leaf, there may be at most two pointers associated to the desired position (one to the parent 
and another to a wavelet tree child), so we can scan for the desired pointer in constant time. 

The tree Tf is updated upon insertions and deletions of pointers just like T. It must also be 
updated upon insertions and deletions of symbols, even if they do not have pointers. We traverse 
Tj looking for the position of the update, change the offsets stored at the leaf, and update the 
subtree sizes stored at the nodes. 

Structure Hi(u). This structure manages the inter-node pointers that point inside Gi(u). As 
explained in the paper, we give a handle to the outside nodes, that do not change over time, and 
Hi(u) translates handles to positions in Gi(u). 

We store a tree T^ that is just like Tj, where the incoming pointers are stored. T^ is simpler, 
however, because at each node we only need to store the number of positions covered by the subtree 
of each child. Also, it is possible to traverse T^ from a leaf to the root. We also manage a table 
Tbl so that Tbl[h] points to the leaf where the pointer corresponding to handle h is stored. At the 
leaves we store, for each pointer, a backpointer to Tbl and the position in Gi(u) (in relative form). 
Given a handle h, we go to the leaf, find in constant time the one pointing back to h, and move 



14 



upwards up to the root, adding to the position the number of positions covered by leftward children 
of each node. At the end we have obtained the position in constant time. 

When pointers to Gi(u) are created or destroyed, we insert or remove pointers in T^. We 
maintain a list of empty cells in Tbl for future handles. We must also update Tf upon symbol 
insertions and deletions in Gi(u), to maintain the positions up to date. When a leaf splits or merges, 
we update the pointers from a constant number of positions in Tbl, found with the backpointers. 
Since Tbl contains 0(lg 3 n) pointers of O(lglgre) bits, it poses an overhead of 0(lgcrlglgre/lgre) 
per symbol, so we can allocate it for the maximum possible block size. 

Structure Di(u). This is a simple tree Tj similar to T, storing at each node the number of 
positions and the number of undeleted positions below each child. It should be obvious at this 
point how to operate it. 

B More Applications 

Burrows- Wheeler Transform in compressed space. Another application of dynamic se- 
quences is to build the BWT of a text T, T bwt , within compressed space, by starting from an 
empty sequence and inserting each new character, T[n], T[n — 1], . . ., T[l], at the proper positions. 
The result is stated as the compressed construction of a static FM-index |13] , a compressed index 
that consists essentially on a (static) wavelet tree of T bwt . Our new representation improves upon 
the best previous result on compressed space [25]. 




Theorem 5 The Alphabet-Friendly FM-index fT3\j . as well as the BWT fffltf . of a text T[l,re] over 
an alphabet of size a, can be built using nHh(T) + o(relger) bits, simultaneously for all 1 < h < 
(alg CT n) — 1 and any constant < a < 1, in time 0(n lgn/lglgn). It can also be built within the 
same time and tiHq{T) + o(re(l + Hq(T))) + 0(a(lgo~ + lg 1+e re)) bits, for any constant e > and 
any alphabet size a. 

We are using Theorem [2] for the case h > 0, and Theorem[3]to obtain a less alphabet-restrictive 
result for h = 0. Note we obtain o(nlgre) time within compressed space. Other space-conscious 
results that achieve better time complexity (but more space) than our result are Okanohara and 
Sadakane [27], who achieved optimal 0(n) time within OfrelgoTglg^re) bits of space, and Hon et 
al. [20], who achieved O(relglgcr) time and O(relger) bits of space. 

Binary relations. Barbay et al. [4] show to represent a binary relation of t pairs relating re 
"objects" with a "labels" by means of a string of t symbols over alphabet [l..<r] plus a bitmap of 
length t + re. The idea is to traverse the matrix, say, object-wise, and write down in a string the 
labels of the pairs found. Meanwhile we append a 1 to the bitmap each time we find a pair and a 
each time we move to the next object. Then queries like: find the objects related to a label, find 
the labels related to an object, and tell whether an object and a label are related, are answered via 
access, rank and select operations on the string and the bitmap. 

A limitation in the past to make this representation dynamic was that creating or removing 
labels implied changing the alphabet of the string. Now we can use Theorem [3] and the results of 
Section 15.31 to obtain a fully dynamic representation. Note our structure is so efficient that the 
bitmap times dominate the complexity. 

Theorem 6 A dynamic binary relation consisting of t pairs relating n objects with a labels can 
support the operations of counting and listing the objects related to a given label, counting and listing 
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the labels related to a given object, and telling whether an object and a label are related, all in time 
0(lg(n + i)/lglg(n + i)) per delivered datum. Pairs can also be added and deleted in amortized time 
0(lg(n+i)/lg lg(n + i)). The space required is tH + n + t + o(n + t + tH) + 0(<r(lg <r + lg 1+e t)), where 
e is any constant and H = ^2i <i<(T (ti/t)lg(t/ti) < Iga, where ti is the number of objects related 
to label i. It is also possible to handle insertions and deletions of labels and objects (deletions need 
that no pair involves them), coming from a universe [1..N], within O(lglgiV) additional time per 
operation and O ( (n + a) lg N) extra bits of space. 

Directed graphs. A particularly interesting and general binary relation is a directed graph with 
n nodes and e edges. Our binary relation representation allows one to navigate it in forward and 
backward direction, and modify it, within little space. 

Theorem 7 A dynamic directed graph consisting of n nodes and e edges can support the operations 
of counting and listing the neighbors pointed from a node, counting and listing the reverse neighbors 
pointing to a node, and telling whether there is a link from one node to another, all in time 0(lg(n+ 
e)/lglg(n+e)) per delivered datum. Edges can also be added and deleted in amortized time 0(lg(n+ 
e)/ lg lg(n + e)). The space required is eH + n + e + o(n + e + eH) + 0(n(lg n + lg 1+<E e)), where e is 
any constant and H = X^i<i<o-( e */ e ) l§( e / e i) — ^S n ; where e, is the outdegree of node i. It is also 
possible to handle insertions and deletions of nodes ( deletions need that the node is disconnected 
from the graph), coming from a universe [1..N], within O(lglgiV) additional time per operation 
and 0{n\gN) extra bits of space. 

Note that we can change "outdegree" by "indegree" in the theorem by representing the trans- 
posed graph, as operations are symmetric. We can similarly transpose general binary relations. 

Inverted indexes. Finally, we consider an application where the symbols are words. Take a text 
T as a sequence of n words, which are strings over a set of letters T. The alphabet of T is £ = T* , 
and its effective alphabet is called the vocabulary V of T, of size \V\ = a. A positional inverted 
index is a data structure that, given a word w £ V, tells the positions in T where w appears pp. 

A well known way to simulate a positional inverted index within no extra space on top of the 
compressed text is to use a compressed sequence representation for T (over alphabet £), so that 
operation select^ (T,i) simulates access to the ith position of the list of word w, whereas access to 
the original T is provided via access (T, i). Operation rank can be used to emulate various inverted 
index algorithms, particularly for intersections [6]. The space is the zero-order entropy of the text 
seen as a sequence of words, which is very competitive in practice. Our new technique permits 
modifying the underlying text, that is, it simulates a dynamic inverted index. For this sake we use 
the technique of Section 15.31 and a trie to handle the vocabulary. 

Theorem 8 A text of n words with a vocabulary of size a can be represented within nHo(T) + 
o(n(l + Hq(T)) + 0(<r(lg a + lg 1+e n)) bits of space, where e > is an arbitrary constant and Hq{T) 
is the word-wise entropy of T . The representation outputs any word T[i] = w given i, finds the 
position of the ith occurrence of any word w, and tells the number of occurrences of any word w up 
to position i, all in time 0{\w\ + lg nj lg lg n). A word w can be inserted or deleted at any position 
in T in amortized time 0{\w\ + lgn/lglgn). 

Another kind of inverted index, a non-positional one, relates each word with the documents 
where it appears (not to the exact positions). This can be seen as a direct application of our binary 
relation representation [2], and our dynamization theorems apply to it as well. 
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